Model Selection

Multimodal Agent

# Multimodal Agent

GUI Actor 7B Qwen2 VL

GUI-Actor-7B is a vision-language model developed based on Qwen2-VL-7B-Instruct, focusing on graphical user interface (GUI) agent tasks and providing a coordinate-free visual grounding solution.

Multimodal Fusion

Qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL is the latest vision-language model from the Qwen family, featuring powerful visual understanding and multimodal processing capabilities, supporting image and video analysis with structured output.

Image-to-Text English

Gemma 3 R1984 4B

Gemma3-R1984-4B is a powerful agent AI platform built upon Google's Gemma-3-4B model, supporting multimodal file processing and deep research capabilities.

Transformers Supports Multiple Languages

VideoMind is a multimodal agent framework that enhances video reasoning by simulating human thought processes.

Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.

Omniparser V2.0

OmniParser is a universal screen parsing tool capable of interpreting/converting UI screenshots into structured formats to enhance LLM-based UI agent performance.

Qwen2.5 VL 3B Instruct 4bit

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and long video processing.

Transformers English

Fuyu-8B is a multimodal text-image transformer developed by Adept AI, designed for digital agents, supporting arbitrary image resolutions with swift responses and a streamlined architecture.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase